Sberbank Russian Housing Market https://www.kaggle.com/c/sberbank-russian-housing-market
Model evaluation: quantifying the quality of predictions http://scikit-learn.org/stable/modules/model_evaluation.html
from IPython.core.display import HTML
hide_code = ''
HTML('''<script> code_show = true;
function code_display() {
if (code_show) {
$('div.input').each(function(id) {if (id == 0 || $(this).html().indexOf('hide_code') > -1) {$(this).hide();}
});
$('div.output_prompt').css('opacity', 0);
} else {
$('div.input').each(function(id) {$(this).show();});
$('div.output_prompt').css('opacity', 1);
}
code_show = !code_show;
}
$(document).ready(code_display);</script>
<form action="javascript: code_display()">
<input style="color: #228B22; background: ghostwhite; opacity: 0.9;"
type="submit" value="Click to display or hide code"></form>''')
hide_code
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import numpy as np
import pandas as pd
import scipy
import seaborn as sns
import matplotlib.pylab as plt
from random import random
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split, ShuffleSplit
from sklearn.model_selection import KFold, ParameterGrid, cross_val_score, GridSearchCV
from sklearn.metrics import mean_squared_error, median_absolute_error, mean_absolute_error
from sklearn.metrics import r2_score, explained_variance_score
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.ensemble import BaggingRegressor, AdaBoostRegressor, ExtraTreesRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.linear_model import Ridge, RidgeCV, BayesianRidge
from sklearn.linear_model import HuberRegressor, TheilSenRegressor, RANSACRegressor
from sklearn.preprocessing import OneHotEncoder, StandardScaler, RobustScaler, MinMaxScaler
from sklearn.pipeline import Pipeline
import keras as ks
from keras.models import Sequential, load_model, Model
from keras.optimizers import SGD, RMSprop
from keras.layers import Dense, Dropout, LSTM
from keras.layers import Activation, Flatten, Input, BatchNormalization
from keras.layers import Conv1D, MaxPooling1D, Conv2D, MaxPooling2D
from keras.layers.embeddings import Embedding
from keras.wrappers.scikit_learn import KerasRegressor
hide_code
def regression(regressor, x_train, x_test, y_train):
reg = regressor
reg.fit(x_train, y_train)
y_train_reg = reg.predict(x_train)
y_test_reg = reg.predict(x_test)
return y_train_reg, y_test_reg
def loss_plot(fit_history):
plt.figure(figsize=(18, 6))
plt.plot(fit_history.history['loss'], color='#348ABD', label = 'train')
plt.plot(fit_history.history['val_loss'], color='#FF7F50', label = 'test')
plt.legend()
plt.title('Loss Function');
def mae_plot(fit_history):
plt.figure(figsize=(18, 6))
plt.plot(fit_history.history['mean_absolute_error'], color='#348ABD', label = 'train')
plt.plot(fit_history.history['val_mean_absolute_error'], color='#FF7F50', label = 'test')
plt.legend()
plt.title('Mean Absolute Error');
def scores(regressor, y_train, y_test, y_train_reg, y_test_reg):
print("_______________________________________")
print(regressor)
print("_______________________________________")
print("EV score. Train: ", explained_variance_score(y_train, y_train_reg))
print("EV score. Test: ", explained_variance_score(y_test, y_test_reg))
print("---------")
print("R2 score. Train: ", r2_score(y_train, y_train_reg))
print("R2 score. Test: ", r2_score(y_test, y_test_reg))
print("---------")
print("MSE score. Train: ", mean_squared_error(y_train, y_train_reg))
print("MSE score. Test: ", mean_squared_error(y_test, y_test_reg))
print("---------")
print("MAE score. Train: ", mean_absolute_error(y_train, y_train_reg))
print("MAE score. Test: ", mean_absolute_error(y_test, y_test_reg))
print("---------")
print("MdAE score. Train: ", median_absolute_error(y_train, y_train_reg))
print("MdAE score. Test: ", median_absolute_error(y_test, y_test_reg))
def scores2(regressor, target, target_predict):
print("_______________________________________")
print(regressor)
print("_______________________________________")
print("EV score:", explained_variance_score(target, target_predict))
print("---------")
print("R2 score:", r2_score(target, target_predict))
print("---------")
print("MSE score:", mean_squared_error(target, target_predict))
print("---------")
print("MAE score:", mean_absolute_error(target, target_predict))
print("---------")
print("MdAE score:", median_absolute_error(target, target_predict))
In this capstone project proposal, prior to completing the following Capstone Project, we will leverage what we've learned throughout the Nanodegree program to author a proposal for solving a problem of our choice by applying machine learning algorithms and techniques. A project proposal encompasses seven key points:
Housing costs demand a significant investment from both consumers and developers. And when it comes to planning a budget—whether personal or corporate—the last thing anyone needs is uncertainty about one of their budgets expenses. Sberbank, Russia’s oldest and largest bank, helps their customers by making predictions about reality prices so renters, developers, and lenders are more confident when they sign a lease or purchase a building.
Although the housing market is relatively stable in Russia, the country’s volatile economy makes forecasting prices as a function of apartment characteristics a unique challenge. Complex interactions between housing features such as a number of bedrooms and location are enough to make pricing predictions complicated. Adding an unstable economy to the mix means Sberbank and their customers need more than simple regression models in their arsenal.
Sberbank is challenging programmers to develop algorithms which use a broad spectrum of features to predict real prices. Competitors will rely on a rich dataset that includes housing data and macroeconomic patterns. An accurate forecasting model will allow Sberbank to provide more certainty to their customers in an uncertain economy.
hide_code
HTML('''<div id="data">
<p><iframe src="data_dictionary.txt" frameborder="0" height="300"width="97%"></iframe></p>
</div>''')
hide_code
macro = pd.read_csv('macro.csv')
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
hide_code
macro[100:107].T[1:15]
hide_code
train[200:207].T[1:15]
hide_code
test[100:107].T[1:15]
hide_code
X_list_num = ['full_sq', 'num_room', 'floor', 'area_m',
'timestamp',
'preschool_education_centers_raion', 'school_education_centers_raion',
'children_preschool', 'children_school',
'shopping_centers_raion', 'healthcare_centers_raion',
'office_raion', 'sport_objects_raion',
'public_transport_station_min_walk',
'railroad_station_walk_min', 'railroad_station_avto_km', 'bus_terminal_avto_km',
'cafe_count_500',
'kremlin_km', 'workplaces_km',
'ID_metro', 'metro_km_avto', 'metro_min_walk',
'public_healthcare_km', 'shopping_centers_km', 'big_market_km',
'fitness_km', 'swim_pool_km', 'stadium_km', 'park_km',
'kindergarten_km', 'school_km', 'preschool_km', 'university_km', 'additional_education_km',
'theater_km', 'exhibition_km', 'museum_km',
'big_road1_km', 'big_road2_km',
'detention_facility_km', 'cemetery_km', 'oil_chemistry_km', 'radiation_km',
'raion_popul', 'work_all', 'young_all', 'ekder_all']
X_list_cat = ['sub_area', 'ecology', 'big_market_raion']
features_train = train[X_list_num]
features_test = test[X_list_num]
target_train = train['price_doc']
hide_code
plt.style.use('seaborn-whitegrid')
f, (ax1, ax2) = plt.subplots(ncols=2, figsize=(16, 6))
sns.distplot(target_train, bins=200, color='#348ABD', ax=ax1)
ax1.set_xlabel("Prices")
sns.distplot(np.log(target_train), bins=200, color='#348ABD', ax=ax2)
ax2.set_xlabel("Logarithm of the variable Prices")
plt.suptitle('Sberbank Russian Housing Data');
hide_code
print ("Sberbank Russian Housing Dataset Statistics: \n")
print ("Number of houses = ", len(target_train))
print ("Number of features = ", len(list(features_train.keys())))
print ("Minimum house price = ", np.min(target_train))
print ("Maximum house price = ", np.max(target_train))
print ("Mean house price = ", "%.2f" % np.mean(target_train))
print ("Median house price = ", "%.2f" % np.median(target_train))
print ("Standard deviation of house prices =", "%.2f" % np.std(target_train))
hide_code
features_train.isnull().sum()
hide_code
features_test.isnull().sum()
hide_code
df = pd.DataFrame(features_train, columns=X_list_num)
df['prices'] = target_train
df = df.dropna(subset=['num_room'])
df['metro_min_walk'] = df['metro_min_walk'].interpolate(method='linear')
features_test['metro_min_walk'] = features_test['metro_min_walk'].interpolate(method='linear')
df['railroad_station_walk_min'] = df['railroad_station_walk_min'].interpolate(method='linear')
features_test['railroad_station_walk_min'] = \
features_test['railroad_station_walk_min'].interpolate(method='linear')
df['floor'] = df['floor'].fillna(df['floor'].median())
len(df)
hide_code
ID_metro_cat = pd.factorize(df['ID_metro'])
df['ID_metro'] = ID_metro_cat[0]
ID_metro_pairs = dict(zip(list(ID_metro_cat[1]), list(set(ID_metro_cat[0]))))
ID_metro_pairs[224] = 219
features_test['ID_metro'].replace(ID_metro_pairs, inplace=True)
macro['salary'] = macro['salary'].interpolate(method='linear')
usdrub_pairs = dict(zip(list(macro['timestamp']), list(macro['usdrub'])))
salary_pairs = dict(zip(list(macro['timestamp']), list(macro['salary'])))
df['timestamp'].replace(usdrub_pairs,inplace=True)
features_test['timestamp'].replace(usdrub_pairs, inplace=True)
df.rename(columns={'timestamp' : 'usdrub'}, inplace=True)
features_test.rename(columns={'timestamp' : 'usdrub'}, inplace=True)
hide_code
pearson = df.corr(method='pearson')
corr_with_prices = pearson.ix[-1][:-1]
corr_with_prices[abs(corr_with_prices).argsort()[::-1]]
hide_code
features_list2 = corr_with_prices[abs(corr_with_prices).argsort()[::-1]][:32].index.values.tolist()
print(features_list2)
hide_code
target_train = df['prices']
features_train = df.drop('prices', 1)
target_train2 = target_train
features_train2 = features_train[features_list2]
features_test2 = features_test[features_list2]
target_train = target_train.as_matrix()
features_train = features_train.as_matrix()
features_test = features_test.as_matrix()
target_train2 = target_train2.as_matrix()
features_train2 = features_train2.as_matrix()
features_test2 = features_test2.as_matrix()
hide_code
X_train, X_test, y_train, y_test = train_test_split(features_train, target_train,
test_size = 0.2, random_state = 1)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
hide_code
X_train2, X_test2, y_train2, y_test2 = train_test_split(features_train2, target_train2,
test_size = 0.2, random_state = 1)
X_train2.shape, X_test2.shape, y_train2.shape, y_test2.shape
hide_code
x_scale = RobustScaler()
X_train = x_scale.fit_transform(X_train)
X_test = x_scale.transform(X_test)
x_scale2 = RobustScaler()
X_train2 = x_scale2.fit_transform(X_train2)
X_test2 = x_scale2.transform(X_test2)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
hide_code
y_scale = RobustScaler()
s_y_train = y_scale.fit_transform(y_train.reshape(-1,1))
s_y_test = y_scale.transform(y_test.reshape(-1,1))
y_scale2 = RobustScaler()
s_y_train2 = y_scale2.fit_transform(y_train2.reshape(-1,1))
s_y_test2 = y_scale2.transform(y_test2.reshape(-1,1))
s_y_train.shape, s_y_test.shape
hide_code
param_grid_gbr = {'max_depth': [4, 5, 6], 'n_estimators': range(48, 481, 48)}
gridsearch_gbr = GridSearchCV(GradientBoostingRegressor(),
param_grid_gbr, n_jobs=5).fit(X_train, y_train)
gridsearch_gbr.best_params_
hide_code
param_grid_gbr2 = {'max_depth': [3, 4, 5], 'n_estimators': range(32, 321, 32)}
gridsearch_gbr2 = GridSearchCV(GradientBoostingRegressor(),
param_grid_gbr2, n_jobs=5).fit(X_train2, y_train2)
gridsearch_gbr2.best_params_
hide_code
param_grid_br = {'n_estimators': range(48, 481, 48)}
gridsearch_br = GridSearchCV(BaggingRegressor(),
param_grid_br, n_jobs=5).fit(X_train, y_train)
gridsearch_br.best_params_
hide_code
param_grid_br2 = {'n_estimators': range(32, 321, 32)}
gridsearch_br2 = GridSearchCV(BaggingRegressor(),
param_grid_br2, n_jobs=5).fit(X_train2, y_train2)
gridsearch_br2.best_params_
hide_code
y_train_gbr, y_test_gbr = regression(GradientBoostingRegressor(max_depth=4, n_estimators=240),
X_train, X_test, y_train)
y_train_br, y_test_br = regression(BaggingRegressor(n_estimators=384),
X_train, X_test, y_train)
hide_code
print('48 features')
scores('GradientBoostingRegressor', y_train, y_test, y_train_gbr, y_test_gbr)
scores('BaggingRegressor', y_train, y_test, y_train_br, y_test_br)
hide_code
y_train_gbr2, y_test_gbr2 = regression(GradientBoostingRegressor(max_depth=4, n_estimators=288),
X_train2, X_test2, y_train2)
y_train_br2, y_test_br2 = regression(BaggingRegressor(n_estimators=128),
X_train2, X_test2, y_train2)
hide_code
print('32 features')
scores('GradientBoostingRegressor', y_train2, y_test2, y_train_gbr2, y_test_gbr2)
scores('BaggingRegressor', y_train2, y_test2, y_train_br2, y_test_br2)
hide_code
mlpr = MLPRegressor(hidden_layer_sizes=(240,), max_iter=200, solver='lbfgs',
alpha=0.01, verbose=2)
mlpr.fit(X_train, y_train)
y_train_mlpr = mlpr.predict(X_train)
y_test_mlpr = mlpr.predict(X_test)
scores('MLP Regressor #1', y_train, y_test, y_train_mlpr, y_test_mlpr)
hide_code
mlpr2 = MLPRegressor(hidden_layer_sizes=(288,), max_iter=300, solver='lbfgs',
alpha=0.01, verbose=2)
mlpr2.fit(X_train2, y_train2)
y_train_mlpr2 = mlpr2.predict(X_train2)
y_test_mlpr2 = mlpr2.predict(X_test2)
scores('MLP Regressor #2', y_train2, y_test2, y_train_mlpr2, y_test_mlpr2)
hide_code
plt.figure(figsize = (18, 6))
plt.plot(y_test[1:50], color = 'black', label='Real Data')
plt.plot(y_test_gbr[1:50], label='Gradient Boosting')
plt.plot(y_test_br[1:50], label='Bagging Regressor')
plt.plot(y_test_mlpr[1:50], label='MLP Regressor')
plt.legend()
plt.title("48 Features; Regressor Predictions vs Real Data");
hide_code
plt.figure(figsize = (18, 6))
plt.plot(y_test2[1:50], color = 'black', label='Real Data')
plt.plot(y_test_gbr2[1:50], label='Gradient Boosting')
plt.plot(y_test_br2[1:50], label='Bagging Regressor')
plt.plot(y_test_mlpr2[1:50], label='MLP Regressor')
plt.legend()
plt.title("32 Features; Regressor Predictions vs Real Data");
hide_code
def mlp_model():
model = Sequential()
model.add(Dense(48, activation='relu', input_dim=48))
model.add(Dense(48, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(192, activation='relu'))
model.add(Dense(192, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(768, activation='relu'))
model.add(Dense(768, activation='relu'))
model.add(Dense(1))
model.compile(loss='mse', optimizer='rmsprop', metrics=['mae'])
return model
mlp_model = mlp_model()
mlp_history = mlp_model.fit(X_train, s_y_train, validation_data=(X_test, s_y_test),
nb_epoch=20, batch_size=16, verbose=0)
hide_code
loss_plot(mlp_history)
mae_plot(mlp_history)
hide_code
s_y_train_mlp = mlp_model.predict(X_train)
s_y_test_mlp = mlp_model.predict(X_test)
scores('MLP Model #1', s_y_train, s_y_test, s_y_train_mlp, s_y_test_mlp)
hide_code
mlp_model.save('mlp_model_p6.h5')
hide_code
def mlp_model2():
model = Sequential()
model.add(Dense(32, activation='relu', input_dim=32))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(128, activation='relu'))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(512, activation='relu'))
model.add(Dense(512, activation='relu'))
model.add(Dense(1))
model.compile(loss='mse', optimizer='rmsprop', metrics=['mae'])
return model
mlp_model2 = mlp_model2()
mlp_history2 = mlp_model2.fit(X_train2, s_y_train2, validation_data=(X_test2, s_y_test2),
nb_epoch=40, batch_size=16, verbose=0)
hide_code
loss_plot(mlp_history2)
mae_plot(mlp_history2)
hide_code
s_y_train_mlp2 = mlp_model2.predict(X_train2)
s_y_test_mlp2 = mlp_model2.predict(X_test2)
scores('MLP Model #2', s_y_train2, s_y_test2, s_y_train_mlp2, s_y_test_mlp2)
hide_code
mlp_model2.save('mlp_model2_p6.h5')
hide_code
def cnn_model():
model = Sequential()
model.add(Conv1D(48, 5, padding='valid', activation='relu', input_shape=(48, 1)))
model.add(MaxPooling1D(pool_size=2))
model.add(Dropout(0.25))
model.add(Conv1D(192, 3, padding='valid', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(768, kernel_initializer='normal', activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, kernel_initializer='normal'))
# opt = keras.optimizers.rmsprop(decay=1e-6)
model.compile(loss='mse', optimizer='rmsprop', metrics=['mae'])
return model
cnn_model = cnn_model()
cnn_history = cnn_model.fit(X_train.reshape(16719, 48, 1), s_y_train,
epochs=25, batch_size=64, verbose=0,
validation_data=(X_test.reshape(4180, 48, 1), s_y_test))
hide_code
loss_plot(cnn_history)
mae_plot(cnn_history)
hide_code
s_y_train_cnn = cnn_model.predict(X_train.reshape(16719, 48, 1))
s_y_test_cnn = cnn_model.predict(X_test.reshape(4180, 48, 1))
scores('CNN Model #1', s_y_train, s_y_test, s_y_train_cnn, s_y_test_cnn)
hide_code
cnn_model.save('cnn_model_p6.h5')
hide_code
def cnn_model2():
model = Sequential()
model.add(Conv1D(32, 5, padding='valid', activation='relu', input_shape=(32, 1)))
model.add(MaxPooling1D(pool_size=2))
model.add(Dropout(0.25))
model.add(Conv1D(128, 3, padding='valid', activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(512, kernel_initializer='normal', activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, kernel_initializer='normal'))
# opt = keras.optimizers.rmsprop(decay=1e-6)
model.compile(loss='mse', optimizer='rmsprop', metrics=['mae'])
return model
cnn_model2 = cnn_model2()
cnn_history2 = cnn_model2.fit(X_train2.reshape(16719, 32, 1), s_y_train2,
epochs=30, batch_size=16, verbose=0,
validation_data=(X_test2.reshape(4180, 32, 1), s_y_test2))
hide_code
loss_plot(cnn_history2)
mae_plot(cnn_history2)
hide_code
s_y_train_cnn2 = cnn_model2.predict(X_train2.reshape(16719, 32, 1))
s_y_test_cnn2 = cnn_model2.predict(X_test2.reshape(4180, 32, 1))
scores('CNN Model #2', s_y_train2, s_y_test2, s_y_train_cnn2, s_y_test_cnn2)
hide_code
cnn_model2.save('cnn_model2_p6.h5')
hide_code
def rnn_model():
model = Sequential()
model.add(LSTM(192, return_sequences=True, input_shape=(1, 48)))
model.add(LSTM(768, return_sequences=False))
model.add(Dense(1))
model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
return model
rnn_model = rnn_model()
rnn_history = rnn_model.fit(X_train.reshape(16719, 1, 48), s_y_train.reshape(16719),
epochs=8, verbose=0,
validation_data=(X_test.reshape(4180, 1, 48), s_y_test.reshape(4180)))
hide_code
loss_plot(rnn_history)
mae_plot(rnn_history)
hide_code
s_y_train_rnn = rnn_model.predict(X_train.reshape(16719, 1, 48))
s_y_test_rnn = rnn_model.predict(X_test.reshape(4180, 1, 48))
scores('RNN Model #1', s_y_train, s_y_test, s_y_train_rnn, s_y_test_rnn)
hide_code
rnn_model.save('rnn_model_p6.h5')
hide_code
def rnn_model2():
model = Sequential()
model.add(LSTM(128, return_sequences=True, input_shape=(1, 32)))
model.add(LSTM(512, return_sequences=False))
model.add(Dense(1))
model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
return model
rnn_model2 = rnn_model2()
rnn_history2 = rnn_model2.fit(X_train2.reshape(16719, 1, 32), s_y_train2,
epochs=8, verbose=2,
validation_data=(X_test2.reshape(4180, 1, 32), s_y_test2))
hide_code
loss_plot(rnn_history2)
mae_plot(rnn_history2)
hide_code
s_y_train_rnn2 = rnn_model2.predict(X_train2.reshape(16719, 1, 32))
s_y_test_rnn2 = rnn_model2.predict(X_test2.reshape(4180, 1, 32))
scores('RNN Model #2', s_y_train2, s_y_test2, s_y_train_rnn2, s_y_test_rnn2)
hide_code
rnn_model2.save('rnn_model2_p6.h5')
hide_code
y_train_mlp = y_scale.inverse_transform(s_y_train_mlp)
y_test_mlp = y_scale.inverse_transform(s_y_test_mlp)
y_train_cnn = y_scale.inverse_transform(s_y_train_cnn)
y_test_cnn = y_scale.inverse_transform(s_y_test_cnn)
y_train_rnn = y_scale.inverse_transform(s_y_train_rnn)
y_test_rnn = y_scale.inverse_transform(s_y_test_rnn)
##########################################################
y_train_mlp2 = y_scale2.inverse_transform(s_y_train_mlp2)
y_test_mlp2 = y_scale2.inverse_transform(s_y_test_mlp2)
y_train_cnn2 = y_scale2.inverse_transform(s_y_train_cnn2)
y_test_cnn2 = y_scale2.inverse_transform(s_y_test_cnn2)
y_train_rnn2 = y_scale2.inverse_transform(s_y_train_rnn2)
y_test_rnn2 = y_scale2.inverse_transform(s_y_test_rnn2)
hide_code
plt.figure(figsize = (18, 6))
plt.plot(y_test[1:50], color = 'black', label='Real Data')
plt.plot(y_test_mlp[1:50], label='MLP')
plt.plot(y_test_cnn[1:50], label='CNN')
plt.plot(y_test_rnn[1:50], label='RNN')
plt.legend()
plt.title("48 Features; Neural Network Predictions vs Real Data");
hide_code
plt.figure(figsize = (18, 6))
plt.plot(y_test2[1:50], color = 'black', label='Real Data')
plt.plot(y_test_mlp2[1:50], label='MLP')
plt.plot(y_test_cnn2[1:50], label='CNN')
plt.plot(y_test_rnn2[1:50], label='RNN')
plt.legend()
plt.title("32 Features; Neural Network Predictions vs Real Data");
hide_code
feature_scale = RobustScaler()
s_features_train = feature_scale.fit_transform(features_train)
s_features_test = feature_scale.transform(features_test)
target_scale = RobustScaler()
s_target_train = target_scale.fit_transform(target_train)
##################################################################
feature_scale2 = RobustScaler()
s_features_train2 = feature_scale2.fit_transform(features_train2)
s_features_test2 = feature_scale2.transform(features_test2)
target_scale2 = RobustScaler()
s_target_train2 = target_scale2.fit_transform(target_train2)
hide_code
gbr = GradientBoostingRegressor(max_depth=4, n_estimators=240)
gbr.fit(s_features_train, target_train)
target_train_predict_gbr = gbr.predict(s_features_train)
target_test_predict_gbr = gbr.predict(s_features_test)
scores2('Gradient Boosting Regressor', target_train, target_train_predict_gbr)
hide_code
br = BaggingRegressor(n_estimators=384)
br.fit(s_features_train, target_train)
target_train_predict_br = br.predict(s_features_train)
target_test_predict_br = br.predict(s_features_test)
scores2('Bagging Regressor', target_train, target_train_predict_br)
hide_code
target_train_predict_mlpr = mlpr.predict(s_features_train)
target_test_predict_mlpr = mlpr.predict(s_features_test)
scores2('MLP Regressor', target_train, target_train_predict_mlpr)
hide_code
plt.figure(figsize = (18, 6))
plt.plot(target_train[1:50], color = 'black', label='Real Data')
plt.plot(target_train_predict_gbr[1:50], label='Gradient Boosting Regressor')
plt.plot(target_train_predict_br[1:50], label='Bagging Regressor')
plt.plot(target_train_predict_mlpr[1:50], label='MLP Regressor')
plt.legend()
plt.title("48 Features; Regressor Train Predictions vs Real Data");
hide_code
plt.figure(figsize = (18, 6))
plt.plot(target_test_predict_gbr[1:50], label='Gradient Boosting Regressor')
plt.plot(target_test_predict_br[1:50], label='Bagging Regressor')
plt.plot(target_test_predict_mlpr[1:50], label='MLP Regressor')
plt.legend()
plt.title("48 Features; Regressor Test Predictions");
hide_code
gbr2 = GradientBoostingRegressor(max_depth=4, n_estimators=288)
gbr2.fit(s_features_train2, target_train2)
target_train_predict_gbr2 = gbr2.predict(s_features_train2)
target_test_predict_gbr2 = gbr2.predict(s_features_test2)
scores2('Gradient Boosting Regressor', target_train2, target_train_predict_gbr2)
hide_code
br2 = BaggingRegressor(n_estimators=128)
br2.fit(s_features_train2, target_train2)
target_train_predict_br2 = br2.predict(s_features_train2)
target_test_predict_br2 = br2.predict(s_features_test2)
scores2('Bagging Regressor', target_train2, target_train_predict_br2)
hide_code
target_train_predict_mlpr2 = mlpr2.predict(s_features_train2)
target_test_predict_mlpr2 = mlpr2.predict(s_features_test2)
scores2('MLP Regressor', target_train2, target_train_predict_mlpr2)
hide_code
plt.figure(figsize = (18, 6))
plt.plot(target_train2[1:50], color = 'black', label='Real Data')
plt.plot(target_train_predict_gbr2[1:50], label='Gradient Boosting Regressor')
plt.plot(target_train_predict_br2[1:50], label='Bagging Regressor')
plt.plot(target_train_predict_mlpr2[1:50], label='MLP Regressor')
plt.legend()
plt.title("32 Features; Regressor Train Predictions vs Real Data");
hide_code
plt.figure(figsize = (18, 6))
plt.plot(target_test_predict_gbr2[1:50], label='Gradient Boosting Regressor')
plt.plot(target_test_predict_br2[1:50], label='Bagging Regressor')
plt.plot(target_test_predict_mlpr2[1:50], label='MLP Regressor')
plt.legend()
plt.title("32 Features; Regressor Test Predictions");
hide_code
s_target_train_predict_mlp = mlp_model.predict(s_features_train)
s_target_test_predict_mlp = mlp_model.predict(s_features_test)
scores2('MLP #1', s_target_train, s_target_train_predict_mlp)
hide_code
s_target_train_predict_cnn = cnn_model.predict(s_features_train.reshape(20899, 48, 1))
s_target_test_predict_cnn = cnn_model.predict(s_features_test.reshape(7662, 48, 1))
scores2('CNN #1', s_target_train, s_target_train_predict_cnn)
hide_code
s_target_train_predict_rnn = rnn_model.predict(s_features_train.reshape(20899, 1, 48))
s_target_test_predict_rnn = rnn_model.predict(s_features_test.reshape(7662, 1, 48))
scores2('RNN #1', s_target_train, s_target_train_predict_rnn)
hide_code
target_train_predict_mlp = target_scale.inverse_transform(s_target_train_predict_mlp)
target_test_predict_mlp = target_scale.inverse_transform(s_target_test_predict_mlp)
target_train_predict_cnn = target_scale.inverse_transform(s_target_train_predict_cnn)
target_test_predict_cnn = target_scale.inverse_transform(s_target_test_predict_cnn)
target_train_predict_rnn = target_scale.inverse_transform(s_target_train_predict_rnn)
target_test_predict_rnn = target_scale.inverse_transform(s_target_test_predict_rnn)
hide_code
plt.figure(figsize = (18, 6))
plt.plot(target_train[1:50], color = 'black', label='Real Data')
plt.plot(target_train_predict_mlp[1:50], label='MLP')
plt.plot(target_train_predict_cnn[1:50], label='CNN')
plt.plot(target_train_predict_rnn[1:50], label='RNN')
plt.legend()
plt.title("48 Features; Neural Network Train Predictions vs Real Data");
hide_code
plt.figure(figsize = (18, 6))
plt.plot(target_test_predict_mlp[1:50], label='MLP')
plt.plot(target_test_predict_cnn[1:50], label='CNN')
plt.plot(target_test_predict_rnn[1:50], label='RNN')
plt.legend()
plt.title("48 Features; Neural Network Test Predictions");
hide_code
s_target_train_predict_mlp2 = mlp_model2.predict(s_features_train2)
s_target_test_predict_mlp2 = mlp_model2.predict(s_features_test2)
scores2('MLP #2', s_target_train2, s_target_train_predict_mlp2)
hide_code
s_target_train_predict_cnn2 = cnn_model2.predict(s_features_train2.reshape(20899, 32, 1))
s_target_test_predict_cnn2 = cnn_model2.predict(s_features_test2.reshape(7662, 32, 1))
scores2('CNN #2', s_target_train2, s_target_train_predict_cnn2)
hide_code
s_target_train_predict_rnn2 = rnn_model2.predict(s_features_train2.reshape(20899, 1, 32))
s_target_test_predict_rnn2 = rnn_model2.predict(s_features_test2.reshape(7662, 1, 32))
scores2('RNN #2', s_target_train2, s_target_train_predict_rnn2)
hide_code
target_train_predict_mlp2 = target_scale2.inverse_transform(s_target_train_predict_mlp2)
target_test_predict_mlp2 = target_scale2.inverse_transform(s_target_test_predict_mlp2)
target_train_predict_cnn2 = target_scale2.inverse_transform(s_target_train_predict_cnn2)
target_test_predict_cnn2 = target_scale2.inverse_transform(s_target_test_predict_cnn2)
target_train_predict_rnn2 = target_scale2.inverse_transform(s_target_train_predict_rnn2)
target_test_predict_rnn2 = target_scale2.inverse_transform(s_target_test_predict_rnn2)
hide_code
plt.figure(figsize = (18, 6))
plt.plot(target_train2[1:50], color = 'black', label='Real Data')
plt.plot(target_train_predict_mlp2[1:50], label='MLP')
plt.plot(target_train_predict_cnn2[1:50], label='CNN')
plt.plot(target_train_predict_rnn2[1:50], label='RNN')
plt.legend()
plt.title("32 Features; Neural Network Train Predictions vs Real Data");
hide_code
plt.figure(figsize = (18, 6))
plt.plot(target_test_predict_mlp2[1:50], label='MLP')
plt.plot(target_test_predict_cnn2[1:50], label='CNN')
plt.plot(target_test_predict_rnn2[1:50], label='RNN')
plt.legend()
plt.title("32 Features; Neural Network Test Predictions");
hide_code
plt.figure(figsize = (18, 6))
plt.plot(target_train[1:50], color = 'black', label='Real Data')
plt.plot(target_train_predict_gbr[1:50], label='Gradient Boosting Regressor')
plt.plot(target_train_predict_br[1:50], label='Bagging Regressor')
plt.plot(target_train_predict_mlpr[1:50], label='MLP Regressor')
plt.plot(target_train_predict_mlp[1:50], label='MLP')
plt.plot(target_train_predict_cnn[1:50], label='CNN')
plt.plot(target_train_predict_rnn[1:50], label='RNN')
plt.legend()
plt.title("48 Features; Train Predictions vs Real Data");
hide_code
plt.figure(figsize = (18, 6))
plt.plot(target_test_predict_gbr[1:50], label='Gradient Boosting Regressor')
plt.plot(target_test_predict_br[1:50], label='Bagging Regressor')
plt.plot(target_test_predict_mlpr[1:50], label='MLP Regressor')
plt.plot(target_test_predict_mlp[1:50], label='MLP')
plt.plot(target_test_predict_cnn[1:50], label='CNN')
plt.plot(target_test_predict_rnn[1:50], label='RNN')
plt.legend()
plt.title("48 Features; Test Predictions");
hide_code
plt.figure(figsize = (18, 6))
plt.plot(target_train2[1:50], color = 'black', label='Real Data')
plt.plot(target_train_predict_gbr2[1:50], label='Gradient Boosting Regressor')
plt.plot(target_train_predict_br2[1:50], label='Bagging Regressor')
plt.plot(target_train_predict_mlpr2[1:50], label='MLP Regressor')
plt.plot(target_train_predict_mlp2[1:50], label='MLP')
plt.plot(target_train_predict_cnn2[1:50], label='CNN')
plt.plot(target_train_predict_rnn2[1:50], label='RNN')
plt.legend()
plt.title("32 Features; Train Predictions vs Real Data");
hide_code
plt.figure(figsize = (18, 6))
plt.plot(target_test_predict_gbr2[1:50], label='Gradient Boosting Regressor')
plt.plot(target_test_predict_br2[1:50], label='Bagging Regressor')
plt.plot(target_test_predict_mlpr2[1:50], label='MLP Regressor')
plt.plot(target_test_predict_mlp2[1:50], label='MLP')
plt.plot(target_test_predict_cnn2[1:50], label='CNN')
plt.plot(target_test_predict_rnn2[1:50], label='RNN')
plt.legend()
plt.title("32 Features; Test Predictions");
The project was built on the basis of the competition offered on the site https://www.kaggle.com. The competition version of this notebook is avalible here: https://www.kaggle.com/olgabelitskaya/sberbank-russian-housing-market .
There are several popular resources (numpy, pandas, matplotlib, scikit-learn and keras) for regression models were used.
The most valuable in this project is the study of real data and the attempt to approximate the predictions on them to the threshold of 70-80 percent.